Runtime Speculative Software-only Fault Tolerance
نویسنده
چکیده
Transient faults are emerging as a critical reliability concern for modern microprocessors. Recently, microprocessors have been designed with lower voltage level ,smaller and faster transistors enabled by improved fabrication technology. A combination of increased density of transistors on chip, reduced noise margin of each transistor, and voltage scaling are making hardware systems more susceptible to transient faults than ever. Both hardware or software solutions have been proposed for transient fault tolerance. The hardware approach typically adds redundant hardware modules to the system, thus requiring extra chip area as well as higher hardware design and verification cost. In addition, the scope and mechanism of fault tolerance are hardwired at design time, which could be suboptimal with the change of deployment environment. Unlike hardware solutions, software-only techniques do not require any specialized hardware extensions and are more flexible with the scope of protection and the change of environment. However, even the best-performing software-only fault tolerance techniques incur significant performance cost. The overhead of prior work comes from doubled register usage, frequent inter-core communication, or barrier synchronizations. These factors prevent existing software techniques from being adopted widely. To address these problems, this dissertation proposes Runtime Software-only Speculative Fault Tolerance (RSFT). The key insights behind this dissertation are: (1) not all values are equally important. Transient faults may alter a transistor’s value, which is never used. Only the values that will affect the externally visible behavior of a program must be verified before being used; (2) Value speculation can efficiently remove data dependences introduced by cross checking values produced in the program and its redundant copy with high confidence, thus significantly improves program runtime performance. RSFT serves as a virtual layer between the application and the underlying platform.
منابع مشابه
Encore: Low-Cost, Fine-Grained Transient Fault Recovery
To meet an insatiable consumer demand for greater performance at less power, silicon technology has scaled to unprecedented dimensions. However, the pursuit of faster processors and longer battery life has come at the cost of device reliability. Given the rise of processor (un)reliability as a first-order design constraint, there has been a growing interest in low-cost, non-intrusive techniques...
متن کاملRuntime Verification for Ultra-Critical Systems
Runtime verification (RV) is a natural fit for ultra-critical systems, where correctness is imperative. In ultra-critical systems, even if the software is fault-free, because of the inherent unreliability of commodity hardware and the adversity of operational environments, processing units (and their hosted software) are replicated, and fault-tolerant algorithms are used to compare the outputs....
متن کاملRuntime Verification in Context: Can Optimizing Error Detection Improve Fault Diagnosis?
Runtime verification has primarily been developed and evaluated as a means of enriching the software testing process. While many researchers have pointed to its potential applicability in online approaches to software fault tolerance, there has been a dearth of work exploring the details of how that might be accomplished. In this paper, we describe how a component-oriented approach to software ...
متن کاملAdaptive software fault tolerance policies with dynamic real-time guarantees
Real-time applications with high dependability requirements demand for fault tolerance strategies. While for small systems with static behaviour policies based on worst case execution times can be used, this is not true for more complex systems, in which worst case execution times are partially unknown or differ drastically from their average execution time. In such cases often only a minimum o...
متن کاملTransient Fault Tolerance via Dynamic Process-Level Redundancy
Transient faults are emerging as a critical concern in the reliability of microprocessors. While hardware reliability techniques are often employed for transient fault tolerance, software techniques represent a more cost-effective and flexible alternative. This paper proposes a software approach to transient fault tolerance which utilizes a run-time system to automatically apply process-level r...
متن کامل